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1. Introduction 

In this paper, we provide several examples of 
application of the principal manifold and graphs 
methodology taken from applied projects: comparative 
political science, data analysis in molecular biology, 
theoretical methods of dynamical systems analysis. As it 
will be shown, one of the most common problems in 
these fields is how to approximate a finite set D in R"* 
for relatively large m by a finite subset of a regular low- 
dimensional object in R"*. 

The first hypothesis we have to check is: whether the 
dataset D is situated near a low-dimensional affine 
manifold (plane) in R". If we look for a point, straight 
line, plane, ... that minimizes the average squared 
distance to the datapoints, we immediately come to the 
Principal Component Analysis (PC A). 

PCA is one of the most seminal inventions in data 
analysis. Now it is textbook material. Nonlinear 
generalization of PCA is a great challenge, and many 
attempts have been made to emswer it. In 1982 Kohonen 
introduced a type of neural networks called Self- 
Organizing Maps (SOM)'. 

With the SOM algorithm we take a finite metric 
space Y with metric p (a space of "neurons") and try to 
map it into R"* with (a) the best preservation of initial 



structure in the image of Y and (b)the best 
approximation of the dataset D. The SOM algorithm has 
several setup variables to regulate the compromise 
between these goals. We start from some initial 
approximation of the map, cp^M^i?"". On each (^-th) 

step of the algorithm we have a datapoint x€l D and a 
current approximation (pk'M-^R'". For these x and (pk we 

define an "owner neuron" of x in 7: argminy^Y 
||x-(pj(y)||. The next approximation, (p^+z, is 

(p/t+/(F) = ^ify')+hk w(p(y,yM^'Ptiy))- (1) 

Here hk is a step size, 0<w(fl(y,y^))<l is a 
monotonically decreasing cutting function and p(y,yx) is 
distance between y and y^. There are many ways to 
combine steps (1) in the whole algorithm. The idea of 
SOM is very flexible and seminal, has plenty of 
apphcations and generahzations, but, strictly speaking, 
we don't know what we are looking for: we have the 
algorithm, but no independent definition: SOM is a 
result of the algorithm work. The attempts to define 
SOM as solution of a minimization problem for some 
energy functional were not very successful^. 
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For a known probability distribution, principal 
manifolds were introduced as lines or surfaces passing 
through "the middle" of the data distribution^. 

This intuitive vision was transformed into the 
mathematical notion of self-consistency: every point x 
of the principal manifold M is a conditional expectation 
of all points z that are projected into x. Neither 
manifold, nor projection should be linear: just a 
differentiable projection n of the data space (usually it is 
R"^ or a domain in R"') onto the manifold M with the 
self-consistency requirement for conditional 
expectations: x = E(z|7i(z) = x). For a finite dataset D, 
only one or zero datapoints are typically projected into a 
point of the principal manifold. In order to avoid 
overfitting, we have to introduce smoothers that become 
an essential part of the principal manifold construction 
algorithms. 

SOMs give the most popular approximations for 
principal manifolds: we can take for Y a fragment of a 
regular A:-dimensional grid and consider the resulting 
SOM as the approximation to the A:-dimensional 
principal manifold (see, for example, Refs. 4-5). Several 
original algorithms for construction of principal curves^ 
and surfaces for finite datasets were developed during 
last decade, as well as many applications of this idea. In 
1996, in a discussion about SOM at the 5* Russian 
National Seminar in Neuroinformatics, a method of 
multidimensional data approximation based on elastic 
energy minimization was proposed (see Refs. 7-14 and 
the bibliography there). This method is based on the 
analogy between the principal manifold and the elastic 
membrane (and plate). Following the metaphor of 
elasticity, we introduce two quadratic smoothness 
penalty terms. This allows one to apply standard 
minimization of quadratic functionals (i.e., solving a 
system of linear algebraic equations with a sparse 
matrix). 

2. Method of principal elastic graplis and principal 
elastic maps 

2.1. Principal graphs and manifolds 

Here we give a short description of the structure of 
the elastic functional used for construction of principal 
graphs and manifolds. For more detailed description, 
see other works (in particular, Ref 14). 

In a series of works (see Refs. 7-16), the authors of 
this paper used metaphor of elastic membrane and plate 
to construct one-, two- and three-dimensional principal 
manifold approximations of various topologies. Mean 
squared distance approximation error combined with the 
elastic energy of the membrane serves as a functional to 



be optimised. The elastic map algorithm is extremely 
fast at the optimisation step due to the simplest form of 
the smoothness penalty. It is implemented in several 
programming languages as software libraries or front- 
end user graphical interfaces freely available from the 
web -site http://bioinfo.curie.fr/proiects/vidaexpert . The 
software found applications in microarray data analysis, 
visualization of genetic texts, visualization of 
economical and sociological data and other fields^ '*. 

Let G be a simple undirected graph with set of 
vertices Kand set of edges E. 

Definition, k-star in a graph G is a subgraph with k 
+ 1 vertices Vq.i t e Fand k edges {(vq, v,)|/ = 1, ... k} 

Definition. Suppose that for each k >2, a family St 
of A:-stars in G has been selected. Then we define an 
elastic graph as a graph with selected families of A:-stars 

Sk and for which for all g £ and 5*'' e St, the 

corresponding elasticity moduli 1, > and fitj > are 
defined. 

Definition. Primitive elastic graph is an elastic 
graph in which every non-terminal node (with the 
number of neighbours more than one) is associated with 
a A:-star formed by all neighbours of the node. All k- 
stars in the primitive elastic graph are selected, i.e. the 
St sets are completely determined by the graph 
structure. 

Definition, Let £*'\0), £*'\l) denote two vertices of 
the gi-aph edge and (0),..., 5<" (/t) denote 

vertices of a A:-star (where 5^.'' (0) is the central 
vertex, to which all other vertices are connected). Let us 
consider a map (j):K— >R"' which describes an 
embedding of the graph into a multidimensional space. 
The elastic energy of the graph embedding in the 
Euclidean space is defined as 

U^{G):=Ui{G) + Ui{G), (2) 
Ui{G) := Z A,|ki?<'*(0)) - (t>{E''\\))t , (3) 

UiiG) := E II <l^{S)pm-\i<P{SY\i)) f • (4) 

Definition. Elastic map is a continuous manifold 
FeR'" constructed from the elastic net as its grid 
approximation using some between-node interpolation 
procedure. This interpolation procedure constructs a 
continuous mapping (t)i,..., (t)dim(G)} — >■ R"' from the 
discrete map (|): F — > R'", used to embed the graph in R"', 

and the discrete values of node indices Wv^^dimcc)} > 
/=1...|F|. For example, the simplest piecewise linear 
elastic map is build by piecewise linear map 

Definition. Elastic principal manifold of dimension 
s for a dataset X is an elastic map, constructed from an 
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elastic net Y of dimension s embedded in R™ using such 
a map (|)opt:i'^ R"', that corresponds to the minimal 
value of the functional 

i7<*(X,7) = MSDw(X,F) + i7«*(G), (5) 

where the weighted mean squared distance from the 
dataset Xto the elastic net Y is calculated as the distance 
to the finite set of vertices {y^=(|)(vi),..., y*=<|)(vk)}. 

In the Euclidean space one can apply an EM 
algorithm for estimating the elastic principal manifold 
for a finite dataset. It is based in turn on the general 
algorithm for estimating the locally optimal embedding 
map <j) for an arbitrary elastic graph G (see Ref. 14). 

2.2. Pluriharmonic graphs as ideal approximators 

Approximating datasets by one dimensional 
principal curves is not satisfactory in the case of 
datasets that can be intuitively characterized as 
branched. A principal object which naturally passes 
through the 'middle' of such a data distribution should 
also have branching points that are missing in the 
simple structure of principal curves. Introducing such 
branching points converts principal curves into 
principal graphs. 

In Refs. 10,12,14 it was proposed to use a universal 
form of non-linearity penalty for the branching points. 
The form of this penalty is defined in the previous 
section (4) for the elastic energy of graph embedment. It 
naturally generalizes the simplest three-point second 
derivative approximation squared: for a 2-star (or rib) 
the penalty equals 

II <^(5<^>(0))- i(^(5<^'(l)) + «^(5<^'(2)))||^ , for a 3-star it 

is II (Ksi^Ko)) - ^(^(^3*^'(i)) + ^(^3*^'(2)) + (Ksl^Km \f , 

etc. 

For a A:-star this penalty equals to zero iff the 
position of the central node coincides with the mean 
point of its neighbors. An embedment (|)(G) is 'ideal' if 
all such penalties equal to zero. For a primitive elastic 
graph this means that this embedment is a harmonic 
function on graph: its value in each non-terminal vertex 
is a mean of the value in the closest neighbors of this 
vertex. 

For non-primitive graphs we can consider stars 
which include not all neighbors of their centers. For 
exeunple, for a square lattice we create elastic graph 
(elastic net) using 2-stars (ribs): all vertical 2-stars and 
all horizontal 2-stars. For such elastic net, each non- 



boundary vertex belongs to two stars. For a general 
elastic graph G with sets of ^-stars Sj^ we introduce the 
following notion of pluriharmonic function. 

Definition. A map ^:V-^ R™ defined on vertices of 
G is pluriharmonic iff for any A:-star S['^ e S,^ with the 

central vertex S['^ (0) and the neighbouring vertices 
S['^ (i), i = \ ...k, the equality holds: 

^(Si'\0)) = jmsi^^(i))- (6) 

k 1=1 

Pluriharmonic maps generalize the notion of linear 
map and of harmonic map, simultaneously. For 
example: 

1) ID harmonic functions are linear; 

2) If we consider an « -dimensional cubic lattice as a 
primitive graph (with 2«-stars for all non-boundary 
vertices), then the correspondent pluriharmonic 
functions are just harmonic ones; 

3) If we create from «-dimensional cubic lattice the 
standard «-dimensional elastic net with 2-stars (each 
non-boundary vertex is a center of n 2-stars, one 2-star 
for each coordinate direction), then pluriharmonic 
functions are linear. 

In the theory of principal curves and manifolds the 
penalty functions were introduced to penahse deviation 
from linear manifolds (straight lines or planes). We 
proposed to use pluriharmonic embeddings as "ideal 
objects" instead of manifolds and to introduce penalty 
(5) for deviation from this ideal form. 

2.3. Complexity of principal graphs 

The principal graphs can be called data 
approximators of controllable complexity. By 
complexity of the principal objects we mean the 
following three notions: 

1) Geometric complexity: how far a principal object 
deviates from its ideal configuration; for the elastic 
principal graphs we explicitly measure deviation from 
the 'ideal' pluriharmonic graph by the elastic energy 
U^{G) (3) (this sort of complexity is often just a 
measure of non-linearity); 

2) Structural complexity measure: it is some non- 
decreasing function of the number of vertices, edges and 
^stars of different orders ?,C{G)=?,C{\V\,\E\,\S2\,...,\SJ); 
this function penalises for number of structural 
elements; 
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Fig. 1. Approximation of data used for constructing the quality of life index. Each of 192 points represents a country in 4- 
dimensional space formed by the values of 4 indicators (gross product, life expectancy, infant mortality, tuberculosis incidence). 
Different forms and colors correspond to various geographical locations. Red line represents the principal curve, approximating the 
dataset. Three arrows show the basis of principal components. The best linear index (first principal component) explains 76% of 
variation, while the non-linear index (principal curve) explains 93% of variation. 



3) Construction complexity is defined with respect to a 
graph grammar as a minimum number of apphcations of 
elementary transformations necessary to construct given 
G from the simplest graph (one vertex, zero edges). 

The construction complexity is defined with respect 
to a grammar of elementary transformation. The graph 
grammars'^ '^ provide a well-developed formalism for 
the description of elementary transformations. An 
elastic graph grammar is presented as a set of 
production (or substitution) rules. Each rule has a form 
A B, where A and B are elastic graphs. When this rule 
is apphed to an elastic graph, a copy of A is removed 
fi'om the graph together with all its incident edges and is 
replaced with a copy of B with edges that connect B to 
the graph. For a full description of this language we 
need the notion of a labeled graph. Labels are necessary 
to provide the proper connection between B and the 
graph"*. An approach based on graph grammars to 
constructing effective approximations of elastic 
principal graph was proposed recently'" '^ ''*. 



Let us define graph grammar O as a set of graph 
grammar operations 0={oi,..,Os}. All possible 
applications of a graph grammar operation o, to a graph 
G gives a set of transformations of the initial graph 
o,(G) = {G|, G2, Gp}, where p is the number of all 
possible applications of o, to G. Let us also define a 
sequence of r different graph grammars 

Let us choose a grammar of elementary 
transformations, predefined boundaries of structural 
complexity 5Cn,ax and construction complexity CC^mx , 
and elasticity coefficients Xj and i^^, . 

Definition^ Elastic principal graph for a dataset X is 
such an elastic graph G embedded in the Euclidean 
space by the map (j): K-» R'" that SC(G) < SC^^^ , CC(G) 
< CCmax , and U^{G) min over all possible elastic 
graphs G embeddings in R'" . 

For the simplest choice of grammar, this definition 
gives us principal trees (see the section 3.3). 
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Fig. 2. Comparing linear (b) and non-linear (a) two-dimensional projections of data on the expression of genes in normal human 
tissues. Each point represents a tissue sample (brain, heart, pancreas, etc., each denoted by a distinct color) by a vector of expression 
of several thousands of genes. The linear projection is done by the standard PCA approach. The non-linear projection is made by the 
elastic map method, described in this paper. 



3. Real-life examples of principal graph and 
manifold applications 

3.1. Happiness (quality of life) is non-linear 

Let us take a simple real-life example where 
application of linear data approximation is not 
satisfactory and non-linear one-dimensional principal 
manifolds (principal curves) are needed to adequately 
describe the data. 

In the Political World Atlas project, launched by the 
Moscow State Institute of International Relations", 
systematic quantitative data on 192 modem countries 
for the 1989-2005 period were collected. One of the 
goals was to estimate the "objective" position of post- 
sovetic countries, in particular, in terms of "objective" 
estimates of the quality of life. Similarly to other 
projects on quantification of country happiness or life 
quality, the following quantitative indicators could be 
used for constructing an integral index and ranking: 
gi^oss product per capita in dollars, hfe expectancy in 
years, infant mortality in cases per 10000 habitants, 
tuberculosis incidence in cases per 100000 habitants. A 
common practice in this field is to combine these or 
similar features in one index (number) which usually 



represents a linear combination of features with weights 
determined by experts (see, for example, Ref 20). This 
approach is inevitably biased by the experts' opinion. 
An unbiased approach would be to use the best one- 
dimensional approximation of data to introduce the 
"objective" ranking. It makes sense if a good 
approximation function exists. 

Each of 192 countries is characterized by the 4 
values, thus, we can represent this dataset as a cloud of 
192 points in 4-dimensional space. Simple visualization 
of this distribution allows to make a conclusion that no 
linear function exists that can equally well serve for 
reducing the dimension of this space from 4 to 1 (see 
Fig.l). The distribution is intrinsically curved, hence, 
any linear mapping will inevitably give strong 
distortions in one or other region of dataspace. 

There is a simple reason for this. Observing Fig.l, it 
is easy to realize that all countries can be roughly 
separated in two groups. First group consists of very 
wealthy countries, mostly from Western and Nothem 
Europe, USA, Australia and some others (right branch 
of the distribution on the Fig. 1). It happens that the most 
of variation among these countries can be attributed to 
the gross product per capita feature while others are 
approximately equal for them and do not contribute 
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significantly to the variance. The second group (left 
branch of the distribution on the Fig.l) consists of very 
poor countries (mostly, African), which are "equally 
poor" in terms of the gross product per capita but can be 
very different in terms of their problems (lower of 
higher life expectancy, level of infectious diseases, 
conditions of the state health system), for a number of 
reasons (wars, difference in internal politics, etc.). 
However, this classification is not perfect: in fact, there 
are many countries which are localized in the non-linear 
junction between these two branches of the 4- 
dimensional disfribution. This intermediate group 
includes the most of the post-sovetic countries which 
are the subject of the politologic analysis in this study, 
hence, it is important to correctly position them in the 
global picture. 

To do this, we have constructed a principal curve 
approximating the curved data distribution (red line on 
the Fig. 1). The advantage of using the principal curve 
instead of principal line was significant in term of 
Mean-Squared Error (93% vs 76% of explained 
variance). Importantly, using the non- linear index for 
non-linear one-dimensional dimension reduction can 
significantly change the absolute and relative rankings 
for many countries, including Russia. Thus, for the 
index, constructed using the first principal component, 
Russia in 2005 is ranked the 86* (with the most 
prosperous Luxembourg at the T' place), after, for 
example, Iran (74*) and Egypt (84*). For non-linear 
index, Russia is ranked the 7r' before Iran (77*) and 
Egypt (85*). 

3.2. Dimension reduction for microarrays 

Now let us consider application of two-dimensional 
principal manifolds for data visualization. This is their 
natural application since they permit to create a 
mapping from multidimensional space of data to two- or 
three- dimensional spaces, thus, providing a possibility 
to create visual images of multidimensional 
distributions. 

DNA microarray data is a rich source of information 
for molecular biology (for a recent overview, see 
Ref. 21). This technology found numerous applications 



in understanding various biological processes including 
cancer. It allows simultaneous screening of the 
expression of all genes in a cell exposed to some 
specific conditions (for example, stress, cancer, 
treatment, normal conditions). Obtaining a sufficient 
number of observations (chips), one can construct a 
table of "samples vs genes", containing logarithms of 
the expression levels of, typically several thousands («) 
of genes, in typically several tens {m) of samples. 

For this study, we took three distinct microarray 
datasets, provided at 

http://www.math.le.ac.uk/people/aq153/homepaqe/PrincManLeicAuq2006.htm : 

1) Breast cancer dataset from Ref 22, 286 tumor 
samples and 17816 genes, several natural groupings of 
samples accordingly to survival of patients, status of 
some cell receptors and molecular classification of cells. 

2) Bladder cancer dataset from Ref 23, 40 tumor 
samples, several natural groupings of samples 
accordingly to the stage and grade of the tumor 

3) Collection of microarrays for normal human 
tissues from Ref 24, 103 samples of healthy tissues, 
grouped accordingly to the tissue of origin (brain, heart, 
pancreas, etc.). 

In Fig. 2 we compare data visualization scatters after 
projection of the normal tissues dataset, provided in 
Ref 24, onto the linear two-dimensional and non-linear 
two-dimensional principal manifold. The later one is 
constructed by the elastic maps approach. Each point 
here represents a sample of tissue, labeled by the tissue 
name. Before dimension reduction it is represented as a 
vector in R", containing the expression values for all n 
genes in the sample. 

One of the conclusions that can be made from the 
Fig. 2 is that the non-linear mapping is capable of 
resolving the structure of the data distribution in more 
details. In particular, some groups of points that are not 
separated on the linear scatter become separated on the 
non-linear one. We have decided to quantify this effect, 
thus, comparing the quality of linear and non-linear 
mappings of the same dimension. Four criteria, each 
characterizing a projection of dataset into the low- 
dimensional space were developed: 
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1) Mean-Squared Error (MSE). This is simply the 
mean squared distance from all of the points to the 
manifold, where the projection is orthogonal (by the 
shortest distance). MSE is measured in % of total 
variance. Notice that non-linear approximations should 
give smaller MSE by construction, since they are less 
rigid than the linear ones. 

2) Quality of distance mapping (QDM). This is a 
correlation coefficient between the pair-wise distances 
between data points before and after {d^j) 
projection onto the manifold: 



QDM= corr(dij,di) 



(7) 



For estimating QDM measure we also proposed to 
calculate the correlation coefficient only on "the most 
representative" subset of pair-wise distances, selected 
by a procedure which we have called "Natural PCA". 
The procedure was introduced in Ref 13. The first 
"natural" component is defined by the pair of the most 
distant points (;l,7l). The second component is such a 
pair of points {i2,J2) that 1) i2 is the most distant point 
to the first component, where the distance fi'om a point 
to a set of points S is defined as the distance to the 
closest point in S; 2) J2 is the point from the first 
component which is the closest to i2. All the next 
components are introduced analogously: the nth 
component (/„, j„) is such a pair of points that the /„ is 
the most distant from the union 5„.i of all components 
from 1 to «-l and J„ is the point from this union ^n.i 
which is the closest to /„. We use both Pearson and 
Spearman correlation coefficients for calculating QDM. 
3) Quality of point neighborhood preservation (QNP). 
For measuring it, for every data point / we calculate the 
size of the intersection of the set of k neighbours in the 
multi -dimensional space S(i;k) and in the low- 
dimensional space S(i;k) . QNP is the average value of 
this intersection divided by k: 



QNPk =\|kYi=^..N\S(i■,k)r^S{i■,k)\|N . 



(8) 



4) Quality of group compactness (QGC). In this test we 
assume that there is a label C(/) associated with every 
point /. Then, for each label B, we can calculate the 
average number of points of the same color in the k- 
neighborhood of the points before and after projection. 
Let us define c{i; k) as the number of points in the k- 



A/5£- = X||' 




Dataset 


ELMAP2D 


PCI 


PC2 


PC3 


PC4 


PCS 


PCIO 


Breast cancer MSE 


48.1 


52.6 


.50.7 


49.3 


48.4 


47.6 


45.3 


Variation explained 




r.9% 


14.3% 


19.0'X 


22.D?; 


24.0% 


31.5% 


Bladder cancer MSE 


40,e 


48. r 


45.4 


42.8 


40.r 


39.2 


33.0 


Variation explainetl 




21,0% 


31.4%: 


38-9% 


44.8% 


48.8%, 


G3,8%, 


Normal tissues MSE 


3fi.9 


48.8 


45.5 


42.3 


40.1 


38.5 


32.4 


Variation explained 




10.7% 


19.1% 


2e.o% 


30.:!% 


32.2%. 


40.7% 
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neighbourhood of the point / having the label C(i). 
Then, for a label B, 



QGC J, (B) = l/k Ic(o=s c(/; k)/N (B) , 



(9) 
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where N{B) is the number of points having the label B. 
Values of QGC close to one would indicate that the 
points of the label B forms a very compact and 
separated from the other colors group. Comparing 
QGC(B) before and after projection into the low- 
dimensional space, one can conclude on what happened 
with the group of points of the label B. 

It is clear from the Fig. 3-6 that the non-linear two- 
dimensional principal manifolds provide systematically 
better results accordingly to all four criteria, achieving 
the performance of three- and four- dimensional linear 
principal manifolds. 

A special attention should be made to the 
performance of the non-linear principal manifolds with 
respect to the QGC criterion. It works particularly well 
for the collection of normal tissues. There are cases 
when neither linear nor non-linear low-dimensional 
manifolds could put together points of the same class 
and there are a few examples when linear manifolds 
perform better. In the latter cases (Breast cancer's A, B, 
lumA, lumB and "unclassified" classes, bladder cancer 
Tl class), almost all class compactness values are close 
sto the estimated random values which means that these 
classes have big intra-class dispersions or are poorly 
separated from the others. In this case the value of class 
compactness becomes unstable (look, for example, at 
the classes A and B of the breast cancer dataset) and 
depends on random factors which can not be taken into 
account in this framework. 

The closer class compactness is to unity, the easier 
one can construct a decision function separating this 
class from the others. However, in the high-dimensional 
space, due to many degrees of freedom, the "class 
compactness" might be compromised and become better 
after appropriate dimension reduction. In Fig. 6 one can 
find examples when dimension reduction gives better 
class compactness in comparison with that calculated in 
the initial multi-dimensional space (breast cancer basal 
subtype, bladder cancer Grade 2 and Tl, T2+ classes). It 
means that sample classifiers can be regularized by 
dimension reduction using PCA-like methods. 

There are several particular cases (breast cancer 
basal subtype, bladder cancer T2+, Grade 2 subtypes) 
when non-linear manifolds give better class 
compactness than both the multidimensional space and 
linear principal manifolds of the same dimension. In 
these cases we can conclude that the dataset in the 
regions of these classes is naturally "curved" and the 
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application of non-linear techniques for classification 
regularization is an appropriate solution. 

We can conclude that non-linear principal manifolds 
provide systematically better or equal resolution of class 
separation in comparison with linear manifolds of the 
same dimension. They perform particularly well when 
there are many small and relatively compact 
heterogeneous classes (as in the case of normal tissue 
collection). 

3.3. Principal trees and Metro Maps 

Let us demonstrate how the idea of graph grammars 
(section 2) allows constructing the simplest non-trivial 
type of the principal graphs, called principal trees^^'^'^. 
For this purpose let us introduce a simple 'Add a node, 
bisect an edge' graph grammar (see Fig. 7) apphed to 
the class of primitive elastic graphs. 



Fig. 7. Illustration of the simple "add node to a node" or 
"bisect an edge" graph grammar, a) We start with a simple 2- 
star from which one can generate three distinct graphs shown. 
The "Opl" operation is adding a node to a node, operations 
"Opl" and "Op2" are edge bisections (here they are 
topologically equivalent to adding a node to a terminal node of 
the initial 2-star). For illustration let us suppose that the "Op2" 
operation gives the biggest elastic energy decrement, thus it is 
the "optimal" operation, b) From the graph obtained one can 
generate 5 distinct graphs and choose the optimal one. c) The 
process is continued until a definite number of nodes are 
inserted. 



Definition. Principal tree is an acyclic primitive 
elastic principal graph. 

Definition. 'Add a node, bisect an edge ' graph 
grammar 0^^'°"'^ applicable for the class of primitive 
elastic graphs consists of two operations: 1) The 
transformation "add a node" can be applied to any 
vertex v of G: add a new node z and a new edge (v, z); 
2) The transformation "bisect an edge " is applicable to 
any pair of graph vertices v, v ' connected by an edge (v, 
v'): delete edge (v, v'), add a vertex z and two edges, (v, 
z) and (z, v'). The transformation of the elastic structure 



(change in the star list) is induced by the change of 
topology, because the elastic graph is primitive. 
Consecutive application of the operations from this 
grammar generates trees, i.e. graphs without cycles. 

Definition^ 'Remove a leaf, remove an edge ' graph 
grammar (^f'''""*' applicable for the class of primitive 
elastic graphs consists of two operations: 1) The 
transformation 'remove a leaf can be applied to any 
vertex v of G with connectivity degree equal to 1: 
remove v and remove the edge (v,v') connecting v to the 
tree; 2) The transformation 'remove an edge' is 
applicable to any pair of graph vertices v, v ' connected 
by an edge (v, v'): delete edge (v, v'), delete vertex v', 
merge the A:-stars for which v and v' are the central 
nodes and make a new ^-star for which v is the central 
node with a set of neighbors which is the union of the 
neighbors from the ^-stars of v and v '. 

Also we should define the structural complexity 
measure SC(G)=SC(|F|,|£|,|52|,...,|5,J). Its concrete form 
depends on the application field. Here are some simple 
examples: 

1) SC(G) = |F| : i.e., the graph is considered more 
complex if it has more vertices; 



2) SC( G) = • 



l^3l,if|^3l^^™xandE|^,| = 

/t=4 ) 

00, otherwise 



i.e., only b^^ simple branches (3-stars) are 
allowed in the principal tree. 

To construct the principal tree, the following simple 
algorithm is applied: 

1) Initialize the elastic graph G by 2 vertices Vi and V2 
connected by an edge. The initial map (|) is chosen in 
such a way that ^{vi) and (t)(v2) belong to the first 
principal line in such a way that all the data points are 
projected onto the principal line segment defined by 
(t'(vi), ifivi); 

2) For a sequence of gi-ammars = {O**"""', O**""'"', 
0<'*"'*'},7=1..3, repeat steps 3-6: 

3) Apply all grammar operations from O^* to G in all 
possible ways; this gives a collection of candidate graph 
transformations {Gi, G2, ...}; 

4) Separate {G\, G2, ...} into permissible and forbidden 
transformations; permissible transformation G^ is such 
that SC(G/f) < SCmax , where SCmax is some predefined 
structural complexity ceiling; 

5) Optimize the embedding (j) and calculate the elastic 
energy {/^(G) of graph embedment for every 
permissible candidate transformation, and choose such a 
graph Gopt that gives the minimal value of the elastic 
functional: G^p^^arg inf U^{Gj^); 



Gi^ e permissibi set 



6) Substitute G ^ G, 



opt ■> 
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Genealogy tree Metro map 




Fig. 8. "Metro map" representation of normal human tissue 
samples (on the right) and the hierarchical dendrogram of the 
same data (on the left, shown only for comparison). Both 
methods approximate the data by a tree-like structure, but 
using different metaphors ("genealogy tree" in the case of 
hierarchical dendrogram and "metro map" for the harmonic 
dendrite used by the principal tree method). In both 
representations the user can estimate a distance fi^om one 
sample to another by summing up distances along the path in 
the tree. The size of circles on the metro map diagram is 
proportional to the number of points projected into the 
coiTesponding node, the pie diagram represents the 
composition of the cluster in terms of pre-existing sample 
groups (human tissues, in this case). 



7) Repeat steps 2-6 until the set of permissible 
transformations is empty or the number of operations 
exceeds a predefined number - the construction 
complexity. 

Using the 'tree trimming' grammar <9<''""'*' allows to 
produce principal trees closer to the global optimum, 
trimming excessive tree branching and fusing A:-stars 
separated by small 'bridges'. 

Principal trees can have applications in data 
visualization. A principal tree is embedded into a 
multidimensional data space. It approximates the data so 
that one can project points from the multidimensional 
space into the closest node of the tree. The tree by its 
construction is a one -dimensional object, so sthis 
projection performs dimension reduction of the 
multidimensional data. The question is how to produce 
a planar tree layout? Of course, there are many ways to 



layout a tree on a plane without edge intersection. 
But it would be useful if both local tree properties 
and global distance relations would be represented 
using the layout. We can require that 

1) In a two-dimensional layout, all A:-stars 
should be represented equiangular; this is the small 
penalty configuration; 

2) The edge lengths should be proportional to 
their length in the multidimensional embedding; thus 
one can represent between-node distances. 

This defines a tree layout up to global rotation 
and scaling and also up to changing the order of 
leaves in every k-siax. We can change this order to 
eliminate edge intersections, but the result cannot be 
guaranteed. In order to represent the global distance 
structure, it was found that a good approximation for 
the order of A:-star leaves can be taken from the 
projection of every k-sXsx on the linear principal 
plane'^ calculated for all data point or on the local 
principal plane in the vicinity of the A:-star, calculated 
only for the points close to this star. The resulting 
layout can be further optimized using some greedy 
optimization methods. 

Note that the distance on the metro map is 
estimated by summing up the lengths of branches 
along the path (Fig.8). Hence, any layout of the tree 
will not distort this information. The A:-stars are 
projected onto the principal plane in order to find a 
good ordering of nodes inside the star, as a heuristics, 
to avoid excessive tree branch intersections. 
However, this ordering does not encode the distance 
along the tree branches but rather provide a way to 
generate a nice 2D layout. 

The point projections are then represented as pie 
diagrams, where the size of the diagram reflects the 
number of points projected into the corresponding 
tree node. The sectors of the diagram allow us to 
show proportions of points of different classes projected 
into the node. An example of the metro map 
representation applied to the case of microarray dataset 
of normal tissues is shown in Fig. 8 (compare with 2D 
representation of the same dataset shown in Fig. 2). 



3.4. Principal manifolds for dynamical systems 
analysis 

Invariant manifold is a central concept in the theory 
of dynamical systems and its construction provides a 
consistent approach for model reduction^'. The general 
picture behind this concept is that starting from an 
initial condition, the dynamical system quickly reaches 
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vicinity of a curved low-dimensional manifold and 
during most of its dynamics remains close to it. 

In Ref 26 we constructed a dynamical model of 
NFkB biochemical signaling cascade. The model 
contains 35 dynamic variables representing 
concentrations of various biochemical species. This 
system is capable for sustained oscillations and 
converges to a limit cycle in the phase space. When 
studying the details of the dynamics of this system, we 
found that the dynamics in the vicinity of the limit cycle 
can be characterized by presence of a two-dimensional 
invariant manifold^'. 

Using the same methodology as described above in 
the section 2 and a phase space samphng technique, we 
have constructed the invariant manifold approximation. 
To do it, we first sample the trajectories of the 
dynamical system in the vicinity of its limit cycle in the 
following way: 

1) Using the PCA approach we found a linear 
manifold in which the cycle trajectory is embedded. 

2) A system trajectory was computed in some 
interval of time [0;?,„], started from a randomly chosen 
point of the limit cycle plus random shift in the linear 
manifold calculated at Step 1 . The size of this shift 5 
can be made up to the border of the phase space (when 
one of the concentrations becomes zero). 

3) The trajectory obtained at Step 2 was cut into two 
parts: for te[0;af„,] and te[a?„,;?,„] intervals, where 
ae[0;l]. For the sampling, only the second part of the 
trajectory was used, properly discretized. It was done to 
skip the first (fast) dynamics of the system towards a 
hypothetical invariant manifold. 

4) Steps 2 and 3 were repeated until sufficient 
(50000 in our experiments) number points were 
sampled. 

The resulting distribution of points each 
representing a 'snapshot' along the trajectory of the 
dynamical system was approximated by a non-linear 
two-dimensional principal manifold constructed using 
the elastic map approach. The resulting manifold 
projected into the subspace of the first three (linear) 
principal components is shown on the Fig. 9. The 
constructed approximation to the invariant manifold can 
be further used for model reduction by projection of the 
system dynamics onto it^^'^^. 

Principal component analysis apphed to the 
trajectories of a dynamical system sometimes called 
Karhunen-Loeve expansion^''^" and it is a useful tool for 
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Fig. 9. Two-dimensional invariant manifold (blue grid) 
approximation obtained by application of the principal 
manifold methodology to the set of "snapshots" (black points) 
along trajectories of a dynamical system. The manifold is 
shown in the 3D projection of the first three principal 
components. By red one particular trajectory is shown, 
quickly approaching the manifold and further converging 
along it to the limit cycle. 

reducing complexity of the dynamics. Hence, the 
approach presented in this example can be called the 
non-linear Karhunen-Loeve expansion. 

4. Conclusions 

Principal component analysis was introduced into 
applied science by Karl Pearson more than one hundred 
years ago^' and since then it became one of the most 
used mathematical tool in many domains of science (in 
every domain, where approximation of a finite set of 
points is required). 

Non-linear extensions of this method (Self- 
Organizing neural networks and their further 
generalizations such as principal manifolds, principal 
graphs and principal trees) also can serve as a universal 
tool allowing to approximate complex distributions of 
data points, when the linear approximation happens to 
be insufficient. To prove superiority and advantage of 
applying the non-linear approximations, one can use the 
set of benchmarking criteria described in this paper. 

The elastic graph approach can be interpreted as an 
intermediate between absolutely flexible neural gas^^ 
and significantly more restrictive Self-Organizing Maps 
and elastic maps. 

Using efficient implementation of principal graphs 
and manifolds provided by the elastic graph approach. 
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applying these methods in practice become relatively 
easy and computationally fast exercise. In this paper, we 
demonstrate this on several practical examples: from 
comparative political science, data analysis in molecular 
biology and analysis of dynamical systems for 
biochemical modeling. 
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